Are Transformers All That Karel Needs? 
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Abstract 


Recent works have shown the promise of using neural networks for the task of 
program synthesis from input-output examples. The Karel dataset has been a 
benchmark for evaluating program synthesis approaches. Several techniques have 
been proposed to use neural guided program synthesis with Karel being used as a 
baseline. Most of these techniques use an LSTM based model for decoding and 
improve performance by proposing complex algorithmic additions, such as using 
inferred execution traces, latent execution of partial programs and debugging gener- 
ated programs. We observe that by changing the base architecture to a transformer 
based one, specifically GPT2, we are able to apply simple execution guidance on 
top to achieve a generalization accurary of 89.64%, which is within 2.36 percentage 
points of the current state-of-the-art on Karel which uses ensembling. 


1 Introduction 


Programming by example is program generation task such that the generated program satisfies a 
specification given in the form of input-output pairs. Advances in neural networks have spurred 
interest in using deep learning techniques to solve problems in the programming languages space. 
While domains such as performing arithmetic operations, string manipulation, list sorting, query 
generation, etc have been studied in literature, generating programs for languages with more complex 
structures like looping and conditionals is a harder problem to solve. 


The Karel programming language was used in the 1980s as an introductory language by Stanford 
CS students. Its DSL is aimed at controlling a robot in a grid space, where it must navigate to pick and 
place markers while avoiding the walls. The DSL specification (Fig. |1) has support for conditional 
and looping constructs, which make it an ideal test bed for evaluating program synthesis approaches. 
Devlin et al.[|10] introduced the Karel dataset for performing neural program induction. Their task 
was to learn to represent a program and by training on input-output examples and predicting the 
output of a program given a new input grid. Our work is focused on program synthesis, whereby we 
aim to generate a program given the input-output pairs obtained on executing the program. Bunel 
et al. [B] introduced a neural approach for program synthesis using the Karel dataset as a benchmark. 
They used 5 IO pairs as an input to the model to synthesize the program, while using the 6th pair as a 
held out test sample to evaluate the generalization performance. 


Since then a number of approaches have been proposed and studied to improve the performance in 
the Karel program synthesis task. These include inferring trace executions [B2], using execution 
based decoding [6], learning to execute latent programs and to repair programs after decoding 
[16]. These methods focus on making complex changes to how the input and output to the model is 
processed, while keeping the base architecture as an LSTM. We posit that transformer based models 
should be able to achieve much better base performance on Karel which can be improved further by 
adding relatively simpler execution guided search. 
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Progp := def run(): s 


Stmts := while(b): s |repeat(r) : s | s1;s2 |a 
| if(b):s|ifelse(b) : sı else: so 
Condb := frontIsClear() | leftIsClear() | rightIsClear() 
| markersPresent() | noMarkersPresent()| not b 
Actiona := move() | turnRight() | turnLeft() 
| pickMarker() | putMarker() 
Cster := 0|1|--- |19 


Figure 1: Karel DSL specification. Figure from Devlin et al. |10 


2 Related Work 


With the advent of neural program synthesis [1], the models are able to work on different 
domains with little modification and are more robust towards giving generalizable results. Initial 
works in neural program synthesis utilized neural networks to prune the search space for traditional 
program synthesis techniques 09} [25], but then [3] introduced neural models using LSTM layers 
capable of generating programs. Subsequent works improved upon Bunel et al.’s work while keeping 
the network architecture relatively unchanged. showed that using the information from traces of 
individual examples helps the model in generating better programs. (7||6} [24] devised techniques 
to execute the partially generated programs on the inputs to obtain the intermediate states that help 
the model get a better estimation of the next tokens to be generated. While (6) uses methods to 
sample executable partial programs from the output beams to perform search, try to execute 
partial programs in the latent space, removing the need for algorithmic methods of sampling partial 
programs. Gupta et al. improve upon execution guided search by introducing a neural debugger 
to repair the generated programs. 


Introduction of Transformers in Natural Language Processing (NLP) has been revolutionary. 
Transformers overcame pitfalls of LSTMs and GRUs like exploding and vanishing gradients, weak- 
ness in modelling long-term dependencies by using a multi-head attention mechanism that can process 
the input tokens in parallel rather than sequentially. With time, larger and more capable Transformer 
models were introduced that improved the performance of neural models in different problem state- 


ments within NLP (9) {23} [21] [22|[30] and produced comparable performance in computer vision 
and speech processing as well [5] {35} [36]. 


Recently, a lot of attention is being paid towards problems such as source code generation, code 
completion, function naming, bug fixing, etc., in programming languages and impressive results 
are produced with transformer models. [33] is a general purpose code completion tool for multiple 
programming languages, used transformers for natural language to code generation, 
[30] introduced an unsupervised neural transcompiler that learns to translate programs between 
C++, Java and Python even in the absence of parallel training corpus, utilized a BERT based 
architecture for natural language code search and code documentation generation. Scholak, Schucher, 
and Bahdanau[31] show that using vanilla transformers along with some execution guided search are 
able to perform well in synthesizing SQL queries. ProgRES (1). a recently introduced large scale 
few-shot program induction benchmark on C++ programs, uses a transformer architecture (BART) as 
a baseline model for synthesizing programs. This shows that transformers can capture the syntactical 
information of the programming languages they are trying to model. The results show that large 
language models are capable of solving programming tasks without the need of adding additional 
contextual information. 
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Figure 2: Model Architecture: The architecture is based on [3]. 


3 Approach 


3.1 Dataset 


The Karel dataset consists of a train/validation/test split of around 1.1 million synthetically generated 
programs written in the Karel language for the training set, and 2.5k programs each for the validation 
and testing set. Each program is accompanied by 6 input output(IO) pairs. These IO pairs are image 
representations of the state of the Karel robot in the grid world. We use the programs and IO pairs in 
the training set to train our model, and report results by running our inference procedure on the IO 
samples in the test set. 


3.2 Model Architecture 


Our input-output encoding is handled by a convolutional neural network as described in (3). For our 
transformer decoder, we used GPT2 architecture with 6 layers as our base model. We downloaded 
the model available on as our starting point for the transformer, but initialized both the IO 
encoder and GPT2 decoder layers with random weights which are then learned end-to-end. Our 
model architecture is shown in Figure[2] 


3.3 Training Hyperparameters 


Following (3). we set up the task as supervised sequence generation conditioned on a set of input- 
output representations. The decoder is fed a combined representation of the IO encoding and the 
output of the previous timestep. We used Adam optimizer, batch size 8 (due to computational 
constraints), with initial learning rate of le-4. We used a linear learning rate scheduler with warmup 
of 139606 steps (around | epoch) and gradient clipping to ensure that the transformer model doesn’t 
diverge during training and trained for 20 epochs. The model was trained on a DGX-1 machine 
with an Intel Xeon E5-2698 CPU, 32GB of available system memory to the training job, and an 
Nvidia Tesla V100 GPU, using an allocation of 16GB VRAM. 


Method Top-1 Gen Exact Match 


Leveraging Grammar + LSTM 73.67% 39.94% 
10->Trace->Program |32 81.30% 42.80% 
Latent Execution 83.68% 41.12% 
Execution Guided Decoding g 86.04% 40.88% 
Learning to Repair 89.80% 43.48% 
Execution Guided Decoding (Ensemble) [6] 92.00% 47.08% 
LSTM Decoder 73.48% 40.88% 

LSTM + SEG 84.48% 43.28% 
Transformer Decoder 82.40% 43.36% 
Transformer + SEG 89.64% 44.80% 
Transformer + Debugger 90.44% 44.88% 


Table 1: Results on Karel test dataset. 


3.4 Simple Execution Guidance for Search 


During inference, we take the top ranked beam output of the model as the chosen program for 
evaluation. When adding simple execution guidance (SEG), we sample the top 50 beams of the 
model, and use the Karel interpreter to evaluate them on the 5 IO pairs. We take the top ranked 
program which satisfies all the 5 pairs, before using this to evaluate against the 6th held out IO sample. 
If no beam satisfies all the 5 pairs, then we return a NULL program which counts as an incorrect 
output during evaluation. This method of using the interpreter as guidance for beam selection is more 
rudimentary than methods detailed in the past (6) and is similar to the greedy search employed by 
|| 16]. However, since we do not execute partial programs, we do not need to add measures to deal 
with issues surrounding sampling for looping and branching constructs. 


4 Results 


Table[I]shows our top-1 generalisation and exact match accuracies on the Karel test dataset. Our 
generalisation accuracy is calculated by using the program obtained by feeding 5 input-output pairs 
to the model and evaluating this against the 6th held out IO sample. We compare our results against 
previous approaches published for program synthesis. Chen, Liu, and Song (6) report results for both 
a single decoder model, as well as using ensemble approaches. We have reported both results here, 
however our approach only uses a single model during inference. For SEG results, this accuracy is 
calculated on the program selected after searching the beams. We observe a generalisation accuracy of 
82.4% on our base transformer model which increases to 89.64% after adding the execution guidance. 
We also observe an exact match accuracy of 44.8% after performing SEG during inference. 


We also conducted experiments to validate the advantage of using a transformer model on the current 
state-of-the-art which is the SED approach by Gupta et al. We fed the output beams of our transformer 
model to the debugger LSTM trained without fine-tuning as per (16). We used an implementation 
available on GitHub [05]. With this approach, mentioned as Transformer + Debugger in Table[I] we 
achieved a top-1 generalisation accuracy of 90.44%. One thing to note here, we did not perform the 
fine-tuning step which was described by Gupta et al. due to memory constraints in our system. 


4.1 Error Analysis 


While analyzing the results on the test set of 2500 programs, we found that in 86.4% (or 2160 
instances), the execution guidance was selecting the top ranked beam from the model output. In 124 
cases, our SEG inference method returned a NULL program, indicating that no beam satisfied all 5 
input-output pairs. Figure shows that accuracy of the model decreases as the length of the ground 
truth program increases. FigureBp shows that in the majority of cases, the programs generated by the 
model are shorter than the ground truth program. This bias could perhaps be rectified by balancing 
the length of the programs in the dataset. Figure BE shows that SEG seems to choose longer programs 
when the top beam doesn’t satisfy all 6 IO pairs. Figure[3H1 analyzes the programs that satisfied all 
6 IO pairs, while not being an exact match to the ground truth. It shows the number of programs 
that have the same trace as the target program for N IO pairs. This shows that the gap between the 


Accuracy vs Length of Target Program Distribution of Program Length (GPT2) 
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Figure 3: Analysis of the model output. 


generalization and exact match accuracies could be explained by the fact that two programs could 
have the same trace for the IO pairs despite being not an exact match. Figure|3p shows that the 
accuracy of the model increases as more IO pairs are provided to the model. Figure|3f shows that the 
accuracy of the model shows a rising trend as the number of beams searched during SEG is increased. 


5 Conclusion 


We report results for program synthesis on Karel by using transformers as the base architecture, 
replacing the LSTM-based models that have been reported earlier. We show that this change in 
architecture results in a significant improvement in generalisation accuracy. Further, adding even 
performing simple execution guided search over the beam outputs, we are able to achieve performance 
at par with more complex approaches. 
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